IAES International Journal of Artificial Intelligence (IJ-AD 
Vol. 11, No. 1, March 2022, pp. 319~326 
ISSN: 2252-8938, DOI: 10.1159 1/jai.v11.i1.pp319-326 0 319 


Wiki sense bag creation using multilingual word sense 
disambiguation 


Shreya Patankar!, Madhura Phadke’, Satish Devane* 
'2Department of Computer Engineering, Datta Meghe College of Engineering, Navi Mumbai, India 
3Department of Information Technology, Datta Meghe College of Engineering, Navi Mumbai, India 


Article Info ABSTRACT 

Article history: Performance of word sense disambiguation (WSD) is one of the challenging 
tasks in the area of natural language processing (NLP). Generation of sense 

Received Jul 1, 2021 annotated corpus for multilingual word sense disambiguation is out of reach 

Revised Dec 22, 2021 for most languages even if resources are available. In this paper we propose 

Accepted Jan 2, 2022 an unsupervised method using word and sense embedding or improving the 


performance of these systems using untagged. Corpora and create two bags 


namely ontological bag and wiki sense bag to generate the senses with 
Keywords: highest similarity. Wiki sense bag provides external knowledge to the 
system required to boost the disambiguation accuracy. We explore 
Word2Vec model to generate the sense 


Multilingual 
Natural language processing 
Word sense disambiguation 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Shreya Patankar 

Department of Computer Engineering, Datta Meghe College of Engineering 

Sector-3, Airoli, Opp Khandoba Temple Sri Sadguru Vanamrao Pai Marg, Navi Mumbai, Maharashtra, 
India 

Email: snp.cm.dmce @ gmail.com 


1. INTRODUCTION 

Increasing demands by the user to access text data in various languages opens up the doors 
ofmultilingual natural language processing (NLP) and word sense disambiguation (WSD) has proved to be a 
key step in performance improvement of many NLP systems.The accuracy of word sense disambiguation 
systems is far from being satisfactory and multilingual WSD has not achieved satisfactory results due to 
insufficient resource availability [1]. The availability of multilingual dictionaries has enhanced sense 
disambiguation using multilingual content which depicts the need for multilingual WSD [2]. It also opens up 
a different way ofapproaching multilingual WSD by making use of BabelNet, a wide ontological structure 
exploring semantic knowledge. This is the motivation for working on multilingual word sense 
disambiguation by exploring the available resources. 

Relying only on multilingual knowledge-based system may hamper the growth of WSD systems and 
though multilingual dictionaries provide wide coverage exploring the interconnected ontology structure, 
various issues still remain to be seen such as proper nouns are not part of the dictionary and correlation 
between most frequent words and rare contextual words lack dictionary coverage. External knowledge in 
terms of raw text is needed which is provided using word and sense embedding [3]. Our research makes use 
of word and sense embeddings to create a semantic word cloud by designing a wiki bag in addition to the 
sense bag. Wiki bag is designed using Wikipedia as it is the largest encyclopedia which covers most of the 
database essential for disambiguation. The paper is organized being as: section 2 presents the literature 
review which highlights the research work of various researchers, section 3 describes the proposed 
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methodology used which includes working with multilingual input, multilingual dictionary BabelNet and the 
working of WSD engine. Section 4 focuses on results and discussions and section 5 sums up with conclusion. 


2. LITERATURE REVIEW 

Word2Vec model [4]-[13] provides an efficient tool for estimating vector model using the corpus. 
A sense bag was created [14] making use of dictionary resources such as synset members, example 
sentences, hypernymy and hyponymy subsets. A survey was presented on WSD [15] highlighting the 
motivation for solving the ambiguity of words and providing description of the task. The concept of Word 
sense disambiguation in multilingual setting [16] introduces by making use of large encyclopedic ontological 
network BabelNet. Precision achieved was 54.3% when tested on SemEval 2010 dataset. In 2013, Aziz and 
Specia [17] discusses expressing meanings in terms of paraphrases. 

The role of WSD for multilingual scenario of NLP text was surveyed using English-Spanish 
languages [18]. WSD in multilingual machine translation (MT) is based on the concept that resource full 
language helps a resource low language by projecting parameters like sense distributions, and corpus co- 
occurrences [19]. The accuracy observed was 75% for three languages with domain specific corpus. WSD in 
NLP applications is also discussed [20]. Cross-lingual WSD systems was discussed [21], and evaluated on 
SemEval 2010 task. Machine translation is one of the important applications of WSD and is discussed [22], 
[23]. A survey of text classification of Kurdish language is beautifully presented [24]-[27] where they 
applied stemmer algorithm to find the stem to perform classification. WSD network approach, sentiment 
analysis and survey is explored [28]-[31]. It is observed that not much work is reported on WSD in 
multilingual setting to the best of our knowledge and it needs to be explored using various state of the art 
WSD methods. 


3. PROPOSED METHODOLOGY 

The proposed methodology is presented in the Figure 1 and we present the concept of representing 
multilingual input data in section 3.1. It includes accepting multilingual input which will benefit the engine. 
External knowledge is also provided to the system using sense embeddings. 
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Figure 1. Proposed multilingual word sense disambiguation (WSD) framework 


3.1. Multilingual input 

We consider here input from various languages like German and French and make use of Babel Net 
multilingual dictionary described in section 3.2. This is done to explore various languages and taking help 
from other languages improves the system accuracy. Ambiguous word in one language may not be 
ambiguous in other language and this will benefit the system engine for improving the accuracy. 


3.2. BabelNet 

TheBabelNet is a huge multilingual ontological network incorporating lexical semantic and 
syntactic knowledge from various languages [1]. It represents a labelled graph specifying semantic relations 
between various nodes and edges. It combines the knowledge of various language WordNet and largest 
multilingual encyclopedia. Section 3.3 represents the working of WSD with thealgorithm for the same. 
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3.3. Word sense disambiguation (WSD) engine 

WSD engine takes the multilingual input by exploring various languages altogether at the same 
time. It combines the translations of target word and other context words to produce more accurate sense 
predictions. Sense disambiguation begins by gathering the data required for disambiguation where the 
different senses of the ambiguous word are collected in S represented as synonymset from the BabelNet. 
Context words are collected in Ctx and the algorithm then proceeds by picking up the multilingual 
translations of the ambiguous and clue words stored in Tx and Ty respectively. Translations are considered in 
French and German languages as foreign languages are explored. The algorithm iterates through each synset 
s € S to collect the translations of each of its senses [7]. 

Algorithm also iterates through each context word ci € Ctx to collect the translations in Ty in sense- 
specific German and French translations. Element ti is selected from Tx and element tj is selected fromTy 
and a multilingual context p’is created by combining ti and tj with the Ctx. The variable 1’ is used to build a 
graph G= {V, E} by computing the paths in BabelNet which connects the synsets of ti with those of other 
words in ’as shown in Figure 2. By selecting at each step, a different element fromT, a new graph is created 
where different sets of Babel synsets get activated by the context words in Ctx. The result of this procedure is 
a subgraph of BabelNet containing the senses of the words in the context and all edges and intermediate 
senses found in BabelNet along all paths connecting them. Figure 2 shows the disambiguation graph created 
to disambiguate the English language target word ‘bank’. In the graph, some of the possible senses of this 
word are activated including the correct sense (bankgnGuisy) but also related yet incorrect one is activated 
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Figure 2. Disambiguation graph for English language 


Money 
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3.4. Scoring distribution 

Scoring distribution is calculated using the Inverse path length sum measure. It scores each sense by 
summing over the inverse length of all paths which connect it to other senses in the graph. It is very useful 
for sense disambiguation and improves the accuracy. 


: 1 
scorej = grengrny=1 (1) 


Where paths(sj) is the set of simple paths connecting sj to the senses of other context words. Length (p) is the 
number of edges in the path p and each path is scored with the exponential inverse decay of the path length. 
Scores are calculated and stored in Ascore and in the final step; cosine distance similarity measure is 
calculated to find the maximum score which determines the closeness between the ambiguous word and the 
context words. The cosine distance formula is presented in (2): 


TL, S*sC(T) 
Cos (S,SC (T)) = ——2 (2) 


where S is vector representing the score of ambiguous words, SC (T) is vecto rrepresenting the score of 
context words. Global score consists of selecting the highest score represented and as a result of execution 
ofalgorithm; the scoring distribution which is maximum is returned to select the best disambiguation sense. 
Sections 3.5 and 3.6 represents the use of deep learning tools to represent the dictionary framework in 
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numericrepresentation. Table 1 represents the scoring distribution using the above formula for the two senses 
of bank namely building sense bank’ gygrisn and financial institution bank*gngrisu- 


Table 1. Scoringdistribution 


Language bank*pxGuisu bank’ pnouisn 
bankEN 0.666666666 0.3333333333 
bankGERMAN _ 0.333333333 0 
banqueFRENCH _0.444444444 0 


3.5. Synset dictionary framework 

Our study explores the ontology of each sense definition from the dictionary namely hypernym, 
hyponym, holonymy, and gloss. as synset members alone are not sufficient for identifying the correct sense. 
Some of synsets have a very small number of synset members and the other reason is to bring down topic 
drift which may have occurred because of polysemoussynset members. It is also observed that adding gloss 
of hypernym/hyponym gives better performance compared to synset members of hypernym/hyponym [5]. 


3.6. Word and sense embedding 

There is a need to bring the clue words and ambiguous words together which is done using word 
embeddings. It represents embedding continuous vector space with lesser dimensions and word embedding 
are trained using word2Vec tool [4]. The training proceeds by presenting different context-target words pair 
from the corpus thus preparing an ensemble model for all the ambiguous words in the vocabulary as 
presented in Figure 3. The corpus ensemble model of vectors represents the closeness of the context-target 
pair for specific sense and to the best of our knowledge, this is the first of the kind attempt to generate sense 
specific word vector model which represents close proximity between the context words and ambiguous 
word in the vector space. Section 3.6 represents our contribution of sense bag creation. 
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Figure 3. Corpus based ensemble vector model 


3.7. Sense bag creation 

Sense specific vector model is represented by extracting features from the lexical ontology as well 
as encyclopedic knowledge. Words are represented by retrieving the context words from the ontological 
structure of each sense such as synset members, gloss or example sentence, relations such as hypernym or 
hyponym. Word2Vec model is a layered neural network structure that processes the text by converting them 
into vectors; a numerical form which brings related words together. The input to the neural network is 
window of words, hidden layer comprises of weight matrix and output is vector representation of words. 
Wiki sense bag is also created which is vector representation of Wikipedia of ambiguous words. This is done 
so as to provide additional world knowledge to the Word sense disambiguation engine as Wiki sense bag 
covers maximum vocabulary needed to bring context-target pairs closure in the vector space. Wiki bag 
creation is represented in Figure 4 and Similarity measure is calculated in section 3.8. 


3.8. Similarity measure 
The similarity measure is calculated by considering the cosine similarity between the word 
representation of context vector and sense bag representation. It helps to generate a similarity score which 
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helps in the disambiguation process. Cosine similarity measure has proven to be more useful in the word 
sense disambiguation process. 


YL w*SB (3) 
| faa W2* uie1 5B? 


Where vec (w) is the word embedding for word w, SB represents the sense bag and vec (SB) is the sense 
embedding representing the combined score of ontology bag and the wiki sense bag. Sense disambiguation 
(SD) is performed by summing the scores of (1)-(3) which represents multilingual Word sense 
disambiguation similarity score, word embedding and sense embedding scores of ontology bag and wiki 
sense bag to boost the disambiguation accuracy. The output of the WSD engine results in disambiguated 
sense which is converted into neutral language code to be used for MT. Section 3.9 represents the formation 


Cos (vec(w), vec (SB)) = 


of neutral language code. 


-3.1438863e-03 
6.4692349e-04 
-4.77203 12e-03 
3.2847153e-03 
-4.8421333e-03 
1.6703347e-03 
3.0463885e-03 
-2.6362720e-03 
-4.8553180e-03 
2.3314022e-03 
-1.8001328e-03 
4.2599617e-03 
-4.3531498e-03 


2.5703609e-03 
2.635 1425e-03 
-3.4716928e-03 
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3.0184705e-03 
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-3.685644 1e-03 
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1.7899536e-03 
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Figure 4. Wiki sense bag creation 


3.9. Neutral language code 

Words after disambiguation are converted into unique representation termed as neutral language 
code is formed using binary combination of 30-bit unique code where each bit represents significant 
information about the disambiguated polysemy noun represented in Table 2. Neutral language code is unique 
in the sense that it covers all the information other than sense identification and parts of speech. Results are 
presented in the next section. 


Table 2. Neutral language code 
Noun Code 
Bank —_0001000110101100101011110011xx 
000-parts of speech 
0001-unique identification 
1101-Type of noun 
011-number 
001-gender 
11110000-tenses 
Xx - reserved bits 


4. RESULTS AND DISCUSSION 

Word sense disambiguation framework comprises of multilingual input and evaluation is performed 
on a manually created corpus for English language consisting of 25 polysemous nouns, for English lexical 
sample task. Experiments were performed with 5000 instances out of which 70% was used for training and 
30% for testing. Test instances were also collected from various search engines books and the accuracy 
observed for multilingual word sense disambiguation is 40% as compared to 25% observed for monolingual 
word sense disambiguation. Table 3 presents comparison of the two systems and results are presented for 10 
polysemy nouns. For simplicity we consider two senses each for polysemy nouns. The system was tested 
using multilingual approach and observed accuracy was improved by 15 %. The overall accuracy observed 
was 40%. Observations and findings are presented in section 4.1. 
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Table 3. Monolingual versus multilingual word sense disambiguation 


English Sense Accuracy in % for Monolingual word Accuracy in % for Multilingual word 
sense disambiguation sense disambiguation 

Chips Silicon chip 25 45 
Wafers 24 40 
Table Furniture 30 35 
Row/column 32 43 
Bat Mammal 25 45 
Sports 27 45 
Bank Finance 32 45 
Riverbank 32 47 
Tank Military tank 25 44 
Plant Industry plant 35 44 
Tree 35 47 
Stock Capital 30 43 
Storage 29 40 
Palm Hand 28 44 
Name of tree 26 43 
Account Bank account 35 43 
Write up 35 45 


4.1. Observations and findings 

The problem of similar score faced in monolingual approach was eliminated using multilingual 
word sense disambiguation. Observed accuracy is 40% which is far less than the baseline accuracy observed 
for most frequent sense. It is also observed that proper nouns like Madhura, Shreyas from our instances were 
not part of the dictionary definitions which failed to generate proper scores. Also, dictionary definition being 
short lacks strong clues which fail the disambiguation accuracy. 

Features of BabelNet senses are extracted from the synset (S), gloss of synset member (G), 
hypernymy (H), hyponymy (HP), synset gloss of hypernymy-hyponymy relation (HG), holonymy (HO) and 
gloss of holonymy (HOG). We tested these features on 2000 instances and results are represented by taking 
the maximum of the global scores received represented in Table 4. It is observed from the Table 4 that 
combining all the features of BabelNet senses together gives us an improved accuracy of 50%. It shows that 
combining all the features together yields significant improvement in the disambiguation process. 
Multilingual approach implements graph-based disambiguation and we observed that many clue words from 
the context were not in close proximity with the ambiguous words. Many words closely related are at 
distance from one another and this being one of the important findings results in less score which affects the 
disambiguation process. Words in similar context needs to come close for improve the accuracy. Word and 
sense embeddings are presented in section 4.2. 


Table 4. Synset dictionary framework 


Features Global scoreAccuracy in % 
S 0.0869 24 
S+G 0.1923 27 
S+G+H 0.1666 33 
S+G+H+HP 0.0588 38 
S+G+H+HP+HG 0.3333 42 
S+G+H+HP+HG+HO 0.0526 47 
S+G+H+HP+HG+HO+HOG 0.5238 50 


4.2. Word and sense embeddings 

We evaluated our approach for testing the system on word and sense embeddings separately and 
then combining the two results for disambiguation process. Word embeddings are taken from the raw corpus 
and make use of gensim word2Vec model for our study. We compared our work with other state of the art 
methods in terms of precision and recall represented in Table 5. It is observed that our approach with word 
embeddings came close to baseline accuracy and unsupervised most frequent sense (UMFS) approach. Our 
approach gives a feasible way to extract predominant senses in an unsupervised setup. Our approach is 
domain independent so that it can be easily adapted to a domain specific corpus. To get the domain specific 
word and sense embeddings, we simply have to run the word2vec program on the domain specific corpus. 
Also, our approach is language independent and portable across mobile devices as smart phones being the 
most preferred mode of communication. Conclusion is summed up in the next section. 
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Table 5. Performance comparison of sense embeddings with other methods 


System Precision Recall 

Most frequent sense baseline 0.552 0.552 
Lesk algorithm 0.097 0.053 
Adapted lesk 0.240 0.234 

UMES (Bhingardive) 0.433 0.432 
Multilingual WSD with 0.489 0.489 


word and sense embeddings 


5. CONCLUSION 

In this research work, we presented multilingual approach to word sense disambiguation and used 
BabelNet as multilingual lexicon for disambiguation. Multilingual word sense disambiguation exploits graph- 
based method to collect evidences from translations in various languages. We also explored the synset 
dictionary framework by making use of features from BabelNet dictionary. We created separate model for 
each ambiguous word sense and made an ensemble of the word2Vec models for disambiguation purpose 
using word embeddings. Our research contribution includes sense bag creation by using the ontological 
features of the BabelNet lexicon and encyclopedic knowledge from Wikipedia. It is observed that 
multilingual word sense disambiguation achieved good results in comparison to monolingual system as 
additional knowledge from various languages help to boost the accuracy. The results also show that our 
method of multilingual word sense disambiguation with sense embedding improves the accuracy of the 
system. The approach is open to explore other languages. We will explore our approach for other parts of 
speech and other languages especially Indian languages like Marathi, Hindi, and Bangla. We plan in the near 
future to create generalized sense representation for multiple languages so as to provide a general framework 
for knowledge rich multilingual word sense disambiguation. 
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